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Abstract 

Expectation Propagation (EP) provides a framework for approximate inference. When the 
model under consideration is over a latent Gaussian field, with the approximation being 
Gaussian, we show how these approximations can systematically be corrected. A pertur- 
bative expansion is made of the exact but intractable correction, and can be applied to 
the model's partition function and other moments of interest. The correction is expressed 
over the higher-order cumulants which are neglected by EP's local matching of moments. 
Through the expansion, we see that EP is correct to first order. By considering higher 
orders, corrections of increasing polynomial complexity can be applied to the approxima- 
tion. The second order provides a correction in quadratic time, which we apply to an array 
of Gaussian process and Ising models. The corrections generalize to arbitrarily complex 
approximating families, which we illustrate on tree-structured Ising model approximations. 
Furthermore, they provide a polynomial-time assessment of the approximation error. We 
also provide both theoretical and practical insights on the exactness of the EP solution. 
Keywords: Expectation Consistent inference; Expectation Propagation; perturbation 
correction; Wick expansions; Ising model; Gaussian process 



1. Introduction 

Expectation Propagation (EP) (Opper and Winther, 2000, Minka, 2001a, b) is part of a 
rich family of variational methods, which approximate the sums and integrals required for 
exact probabilistic inference by an optimization problem. Variational methods are perfectly 
amenable to probabilistic graphical models, as the nature of the optimization problem often 
allows it to be distributed across a graph. By relying on local computations on a graph, 
inference in very large probabilistic models becomes feasible. 

Being an approximation, some error may invariably be introduced. This paper is specif- 
ically concerned with the error that arises when a Gaussian approximating family is used, 
and lays a systematic foundation for examining and correcting these errors. It follows on 
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earlier work by the authors (Opper et al., 2009). The error that arises when the free energy 
(the negative logarithm of the partition function or normalizer of the distribution) is ap- 
proximated, may for instance be written as a Taylor expansion (Opper et al., 2009, Paquet 
et al., 2009). A pleasing property of EP is that, at its stationary point, the first order term 
of such an expansion is zero. Furthermore, the quality of the approximation can then be 
ascertained in polynomial time by including corrections beyond the first order, or beyond 
the standard EP solution. In general, the corrections improve the approximation when they 
are comparatively small, but can also leave a question mark on the quality of approximation 
when the lower-order terms are large. 

The approach outlined here is by no means unique in correcting the approximation, as 
is evinced by cluster-based expansions (Paquet et al., 2009), marginal corrections for EP 
(Cseke and Heskes, 2011) and the Laplace approximation (Rue et al., 2009), and corrections 
to Loopy Belief Propagation (Chertkov and Chernyak, 2006, Sudderth et al., 2008, Welling 
et al, 2012). 

1.1 Overview 

EP is introduced in a general way in Section 3, making it clear how various degrees of 
complexity can be included in its approximating structure. The partition function will be 
used throughout the paper to explain the necessary machinery for correcting any moments of 
interest. In the experiments, corrections to the marginal and predictive means and variances 
are also shown, although the technical details for correcting moments beyond the partition 
function are relegated to Appendix D. The Ising model, which is cast as a Gaussian latent 
variable model in Section 2, will furthermore be used as a running example throughout the 
paper. 

The key to obtaining a correction lies in isolating the "intractable quantity" from the 
"tractable part" (or EP solution) in the true problem. This is done by considering the 
cumulants of both: as EP locally matches lower-order cumulants like means and variances, 
the "intractable part" exists as an expression over the higher-order cumulants which are 
neglected by EP. This process is outlined in Section 4, which concludes with two useful 
results: a shift of the "intractable part" to be an average over complex Gaussian variables 
with zero variances, and Wick's theorem, which allows us to evaluate the expectations of 
polynomials under centered Gaussian measures. As a last stage, the "intractable part" is 
expanded in Sections 5 and 6 to obtain corrections to various orders. 

Experimental evidence is presented in Section 7 on Gaussian processes (GP) classifica- 
tion and (non-Gaussian) GP regression models. An insightful counterexample where EP 
diverges under increasing data, is also presented. Ising models are examined in Section 8. 

Numerous additional examples, derivations, and material are provided in the appendices. 
Details on different EP approximations can be found in Appendix A, while corrections 
to tree-structured approximations in are provided in Appendix B. In Appendix C we 
analytically show that the correction to a tractable example is zero. The main body of 
the paper deals with corrections to the partition function, while corrections to marginal 
moments are left to Appendix D. Finally, useful calculations of certain cumulants appear 
in Appendix E. 
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2. Gaussian latent variable models 

Let x = (xi, ... ,Xn) be an unobserved random variable with an intractable distribution 
p(x). In the Gaussian latent variable model (GLVM) considered in this paper, terms t n (x n ) 
are combined over a quadratic exponential /o(x) to give 

n 

with partition function (normalizer) 

Z= I n*n(^)/o(x)dx . 
J n 

This model encapsulates many important methods used in statistical inference. As an 
example, /o can encode the covariance matrix of a Gaussian process (GP) prior on latent 
function observations x n . In the case of GP classification, the terms are typically probit 
link functions 1 , for example 

p(x) = ~l[*MM(x;0,K) . (2) 

n 

The probit function is the standard cumulative Gaussian density $(rr) = J\f(z;0, 1) dz. 
In this example, the partition function is not analytically tractable but for the one-dimensional 
case N = 1. 

An Ising model can be constructed by letting the terms t n restrict x n to ±1 (through 
Dirac delta functions). By introducing the symmetric coupling matrix J and field 6 into 
/o, an Ising model can be written as 



^ X ) = ^II \t{x n + l) + ^5{x n -l) exp jix T Jx + 6» T x| 

n L J k J 



(3) 



In the Ising model, the partition function Z is intractable, as it sums /o(x) over 2^ binary 
values of x. In the variational approaches, the intractability is addressed by allowing approx- 
imations to Z and other marginal distributions, decreasing the computational complexity 
from being exponential to polynomial in N, which is typically cubic for EP. 



3. Expectation Propagation 

An approximation to Z can be made by allowing p(x) in Equation (1) to factorize into a 
product of factors f a . This factorization is not unique, and the structure of the factor- 
ization of p(x) defines the complexity of the resulting approximation, resulting in different 
structures in the approximating distribution. Where GLVMs are concerned, a natural and 
computationally convenient choice is to use Gaussian factors g a , and as such, the approxi- 
mating distribution q(x.) in this paper will be Gaussian. Appendix A summarizes a number 
of factorizations for Gaussian approximations. 

1. The dependence on the class label in $(•) is omitted for clarity. 
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The tractability of the resulting inference method imposes a pragmatic constraint on 
the choice of factorization; in the extreme case p(x) could be chosen as a single factor and 
inference would be exact. For the model in Equation (1), a three-term product may be 
factorized as (tiX^X^)) which gives the typical GP setup. When a division is introduced 
and the term product factorizes as (ii^X^s)/^)) the resulting free energy will be that of 
the tree-structured EC approximation (Opper and Winther, 2005a). To therefore allow for 
regrouping, combining, splitting, and dividing terms, a power D a is associated with each 
f a , such that 

P(*) = ^ Hfa(*) Da (4) 

a 

with intractable normalization (or partition function) Z = J J"J / (x)° a dx. Appendix A 
shows how the introduction of D a lends itself to a clear definition of tree-structured and 
more complex approximations. 

To define an approximation to p, terms g a , which typically take an exponential family 
form, are chosen such that 

<?(*) = ^ Uto(*) Da (5) 

^1 a 

has the same structure as p's factorization. Although not shown explicitly, f a and g a have 
a dependence on the same subset of variables x a . The optimal parameters of the g a -term 
approximations are found through a set of auxiliary tilted distributions, defined by 

/ \ If <?(x)/a(x)\ , s 

9a (x) = — -— . (6) 

Z a V 5a (x) J 

Here a single approximating term g a is replaced by an original term f a . Assuming that this 
replacement leaves q a still tractable, the parameters in g a are determined by the condition 
that q(x) and all g a (x) should be made as similar as possible. This is usually achieved by 
requiring that these distributions share a set of generalised moments which usually coincide 
with the sufficient statistics of the exponential family. For example with sufficient statistics 
0(x) we require that 

(0(x)> fe = (0(x)) g for all a. (7) 

Note that those factors f a in p(x) which is already in the exponential family, such as the 
Gaussian terms in examples above, can trivially be solved for by setting g a = f a . The 
partition function associated with this approximation is 

Z EP = Z q l[z^. (8) 

a 

The EP algorithm iteratively updates the g a -terms by enforcing q to share moments with 
each of the tilted distributions q a ; on reaching a fixed point all moments match according 
to Equation (7) (Minka, 2001a, b) . Although Z^p is defined in the terminology of EP, other 
algorithms may be required to solve for the fixed point, and Z^p, as a free energy, can 
be derived from the saddle point of a set of self-consistent (moment-matching) equations 
(Opper and Winther, 2005a, van Gerven et al., 2010, Seeger and Nickisch, 2010). Next we 
make EP concrete by applying it to the Ising model which will serve as a running example 
in the paper. After that we conclude the section with a discussion of the interpretation of 
EP. 
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3.1 EP for Ising models 

The Ising model in Equation (3) will be used as a running example throughout this paper. 
To make the technical developments more concrete, we will consider both the iV-variate and 
bivariate cases. The bivariate case can be solved analytically, and thus allows for a direct 
comparison to be made between the exact and approximate solutions. 

We use the factorized approximation as a running example, dividing p(x) in Equation 
(3) into N + 1 factors with /o(x) = exp{^x T Jx + T x} and f n (x n ) = t n (x n ) = \8(x n + 
1) + \5{x n — 1), for n = 1, . . . , N (see Appendix A for generalizations). We consider the 
Gaussian exponential family such that g n (x n ) = exp{A„ix n — hX^x^} and go( x ) = /o( x )- 
The approximating distribution from Equation (5), q(x) oc /o( x ) II^=i 9n( x n)> is thus a full 
multivariate Gaussian density, which we write as q(x) = AA(x; fi, S). 

3.1.1 Moment matching 

The moment matching condition in Equation (7) involves only the mean and variance 
if q(x) fully factorizes according to p(x)'s terms. We therefore only need to match the 
mean and variances of marginals of q(x) and the tilted distribution g n ( x ) hi Equation 
(6). The tilted distribution may be decomposed into a Gaussian and a discrete part as 
<7n( x ) = qn(x-\ n \x n )q n (x n ) , where the vector x\ n consists of all variables apart from x n . We 
may marginalize out x\„ and write q n {x n ) in terms of two parameters: 

q n {x n ) oc - 5(x n + 1) + S(x n - 1) exp [jx n - ^Ax^j , (9) 

" v " v ' 

/n(x)=tn(x n ) oc / dx\ n ?(x)/ffn (x) 

where we dropped the dependency of 7 and A on n for notational simplicity. Through some 
manipulation, the tilted distribution is equivalent to 

qn{x n ) = —^5(x n -l)-\ — —S(x n + l), where m n = tanh(7) = — --(10) 

2 2 e~ + e ' 

This discrete distribution has mean m n and variance 1 — m^. By adapting the parameters 
of g n (x n ) using for example the EP algorithm, we aim to match the mean and variance of 
the marginal q(x n ) (of (/(x)) to the mean and variance of q n (x n ). The reader is referred to 
Section 8 for benchmarked results for the Ising model. 

3.1.2 Analytic bivariate case 

Here we shall compare the exact result with EP and the correction for simplest non-trivial 
model, the N = 2 Ising model with no external field 



p(x) = \ (<y(n - 1) + 5{xx + 1)) (S(x 2 - 1) + 6(x 2 + 1)) e 

In order to solve the moment matching conditions we observe that the mean values must 
be zero because the distribution is symmetric around zero. Likewise the linear term in the 
approximating factors disappears and we can write g n {x n ) = exp{— Ax^/2} and g(x) = 

T A -J r 1 

M (x; 0, S) with I] = . The moment matching condition for the variances, 

— J A 
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1 = S n nj turns into a second order equation with solution A = \ J 2 + \/J 4 + 4 . We 
can now insert this solution into the expression for EP partition function Equation (8) 
expanding the result to the second order in J 2 in the second equality: 

iogZ EP = -i + iv / i+I^-^iogQ(i + v / i+I^)) = y~T + --- ■ 

Comparing with the exact expression 

log Z = log cosh (J) = — - — + . . . 

we see that EP gives the correct J 2 coefficient but the J 4 comes out wrong. In Section 4 
we investigate how cumulant corrections can correct for this discrepancy. 



3.2 Two explanations why Gaussian EP is often very accurate 

EP, as introduced above, is an algorithm. The justification for the algorithm put forward by 
Minka and adopted by others (see for example recent textbooks by Bishop (2006), Barber 
(2012) and Murphy (2012)) is useful for explaining the steps in the algorithm but may 
be misleading in order to explain why EP in many often provides excellent accuracy in 
estimation of marginal moments and Z. 

The general justification for EP (Minka, 2001a, b) is based upon a minimization of 
Kulback-Leiber (KL) divergences. Ideally, one would determine the approximating distri- 
bution q(x) as the minimizer of KL(p||(/) in an exponential family of (in our case, Gaussian) 
densities. Since this is not possible - it would require the computation of exact moments - 
we instead iteratively minimize "local" KL-divergences KL(g a ||g), between the tilted distri- 
bution q a and q, with respect to g a (appearing in q). This leads to the moment matching 
conditions in Equation (7). The argument for this procedure is essentially that this will 
ensure that the approximation q will capture high density regions of the intractable pos- 
terior p. Obviously, this argument cannot be applied to Ising models because the exact 
and approximate distribution are very different with the former being discrete due to the 
Dirac 5- function binary variables x = ±1 priors. So even though the optimization still 
implies moment matching, this discrete-continuous discrepancy makes local KL-divergences 
KL(g a ||<7) infinite] 

In order to justify the usefulness of EP for Ising models we therefore need an alternative 
argument. Our argument is entirely restricted to Gaussian EP for our extended definition of 
Gaussian latent variable models and do not extend to approximations with other exponential 
families. In the following, we will discuss these assumptions in inference approximations that 
have preceded the formulation of EP in order to provide a possible more relevant justification 
of the method. Although this justification is not strictly necessary for practically using EP 
nor corrections to EP, it nevertheless provides a good starting point for understanding both. 

The argument goes back to the mathematical analysis of the Sherrington-Kirkpatrick 
(SK) model for a disordered magnet (a so-called spin glass) (Sherrington and Kirckpatrick, 
1975). For this Ising model, the couplings J are drawn at random from a Gaussian dis- 
tribution. An important contribution in the context of inference for this model (i.e. the 
computations of partition functions and average magnetizations) was the work of Thouless 



G 



Perturbative Corrections for Approximate Inference 



et al. (1977) who derived self-consistency equations which are assumed to be valid with 
a probability (with respect to the drawing of random couplings) approaching one as the 
number of variables x n grows to infinity. These so-called TAP equations are closely related 
to the EP moment matching conditions of Equation (7). The difference being that the 
TAP equations partly rely on the specific assumption of the randomness of the couplings. 
Self-consistency equations equivalent to the EP moment matching conditions which avoided 
such assumptions on the statistics of the random couplings were first derived by Opper and 
Winther (2000) by using a so-called cavity argument (Mezard et al., 1987). A new impor- 
tant contribution of Minka (2001a) was to provide an efficient algorithmic recipe for solving 
these equations. 

We will now sketch the main idea of the cavity argument for the GLVM. Let x\ n ("x 
without n") denote the complement to x n , that is x = x\„ U x n . Without loss of generality 
we will take the quadratic exponential term to be written as /o(x) oc exp (-x T Ux/2). With 
similar definitions of U\„, the exact marginal distribution of x n may be written as 



It is clear that p n (x n ) depends entirely on the statistics of the random variable h n = 
^2 n 'jt n U nn /x n i . This is the total 'field' created by all other 'magnetic moments' x n > in 
the 'cavity' opened once x n has been removed from the system. In the context of densely 
connected models with weak couplings, we can appeal to the central limit theorem 2 to ap- 
proximate h n by a Gaussian random variable with mean j n and variance V n . When looking 
at the influence of the remaining variables x\„, on x n , the non-Gaussian details of their dis- 
tribution has been washed out in the marginalization. Integrating out the Gaussian random 
variable h n gives the Gaussian cavity field approximation to the marginal distribution: 



This is precisely of the form of the marginal tilted distribution q n (x n ) of Equation (9) 
as given by Gaussian EP. In the cavity formulation, q(x) is simply a placeholder for the 
sufficient statistics of the individual Gaussian cavity fields. So we may observe cases, the 
Ising model or bounded support factors being the prime examples, where EP gives essentially 
correct results for the marginal distributions of the x n and of the partition function Z, while 
</(x) gives a poor or even meaningless (in the sense of KL divergences) approximation to the 
multivariate posterior. Note however, that the entire covariance matrix of the Xyi can De 
computed simply from a derivative of the free energy (Opper and Winther, 2005b) resulting 
in an approximation of this covariance by that of <?(x). This may indicate that a good 

2. In the context of sparsely connected other cavity arguments lead to loopy belief propagation. 
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EP approximation of the free energy may also result in a good approximation to the full 
covariance. The near exactness of EP (as compared to exhaustive summation) in Section 8 
therefore shows the central limit theorem at work. Conversely, mediocre accuracy or even 
failure of Gaussian EP, as also observed in our simulations Sections 7.3 and 8, may be 
attributed to breakdown of the Gaussian cavity field assumption. Exact inference on the 
strongest couplings as considered for the Ising model in Section 8 is one way to alleviate 
the shortcoming of the Gaussian cavity field assumption. 

4. Corrections to EP 

The Zep approximation can be corrected in a principled approach, which traces the following 
outline: 

1. The exact partition function Z is re- written in terms of Zep, scaled by a correction 
factor R = ZjZ^p. This correction factor R encapsulates the intractability in the 
model, and contains a "local marginal" contribution by each f a (see Section 4.1). 

2. A "handle" on R is obtained by writing it in terms of the cumulants (to be defined in 
Section 4.2) of q(x) and <? a (x), Equations (5) and (6). As g a ( x ) and g(x) share their 
two first cumulants, the mean and covariance from the moment matching condition 
Equation (7), a cumulant expansion of R will be in terms of higher-order cumulants 
(see Section 4.2). 

3. R, defined in terms of cumulant differences, is written as a complex Gaussian average. 
Each factor f a contributes a complex random variable k a in this average (see Section 



4. Finally, the cumulant differences are used as "small quantities" in a Taylor series 
expansion of R, and the leading terms are kept (see Sections 5 and 6). 

The series expansion is in terms of a complex expectation with zero "self-covariance" ; 
this has two important consequences. Firstly, it causes all first order terms in the Tay- 
lor expansion to disappear, showing that Zep is correct to first order. Secondly, due 
to Wick's theorem (introduced in Section 4.4), these zeros will contract the expansion 
by making many other terms vanish. 

The strategy that is presented here can be re-used to correct other quantities of interest, 
like marginal distributions or the predictive density of new data when p(x) is a Bayesian 
probabilistic model. These corrections are outlined in Appendix D. 

4.1 Exact expression for correction 

We define the (intractable) correction R as Z = RZ^p. We can derive a useful expression 
for R in a few steps as follows: First we solve f a in Equation (6), and substitute this into 
Equation (4) to obtain 



4.3). 



( 



Z a g a (x)ff a (x) 
?(x) 



) 



D, 



'a 




) 



D, 



'a 



n>w D "=n 



Z EP g(x) J 




a 



a 



a 
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We introduce F(x) 

to derive the expression for the correction R = Z/Zep by integrating Equation (12): 

R = j g(x)F(x)dx . (13) 
where we have used Z = J FT fa{*-) Da dx. Similarly we can write: 

P(x) = \ II /a(x) D " = ^ g(x)F(x) = I g(x)F(x) . (14) 

a 

Corrections to the marginal and predictive densities of p(x) can be computed from this 
formulation. This expression will become especially useful because the terms in F(x) turn 
out to be "local" , that is to only depend on the marginals of the variables associated with 
factor a. Let / a (x) depend on the subset x a of x, and let x\ a ("x without a") denote the 
remaining variables. The distributions Equations (5) and (6) differ only with respect to 
their marginals on x a , q a (x a ) and q(x a ), and therefore 

gq(x) = g(x\ a |x a )g Q (x a ) = g a (x a ) 
<?( x ) <?(x\ a |x a )<?(x a ) q(x a ) 

Now we can rewrite F(x) in terms of marginals: 

The key quantity, then, is F, after which the key operation is to compute its expected value. 
The rest of this section is devoted to the task of obtaining a "handle" on F. 



4.2 Characteristic functions and cumulants 

The distributions present in each of the ratios in -F(x) in Equation (15) share their first 
two cumulants, mean and covariance. Cumulants and cumulant differences are formally 
defined in the next paragraph. This simple observation has a crucial consequence: As the 
(7(x a )'s are Gaussian and do not contain any higher order cumulants (three and above), F 
can be expressed in terms of the higher cumulants of the marginals q a (x a ). When the term- 
product approximation is fully factorized, these are simply cumulants of one-dimensional 
distributions. 

Let N a be the number of variables in subvector x a . In the examples presented in 
this work, A^ a is on6 or two. Furthermore, let k a be £in A^ a -d.imensional vector k a — 
(fei, . . . , &/Va) a - The characteristic function of q a is 

Xa (K) = J e lk " x » q a (x a ) dx a = (e 1 ^)^ , (16) 
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and is obtained through the Fourier transform of the density. Inversely, 

^(x a ) = J e-^ Xa (k a ) dk a . (17) 

The cumulants c aa of q a are the coefficients that appear in the Taylor expansion of log Xa(k a ) 
around the zero vector 

oo 

l0gXa(k o )=^£^k«, 

1=1 \a\=l 

and are defined as 

a 



k a = 

Some notation was introduced in the above two equations to facilitate manipulating a 
multivariate series. The vector a = (a±, . . . , aN a ), with aj S No, denotes a multi-index on 
the elements of k a . Other notational conventions that employ a (writing kj instead of k a j ) 
are: 

( 9 Y TT 



dka) = n 



. dk, 3 

For example, when N a = 2, say for the edge-factors in a spanning tree, the set of multi- 
indices a where |q| = 3 are (3,0), (2,1), (1,2), and (0,3). 

There are two characteristic functions that come into play in -F(x) and R in Equation 
(14). The first is that of the tilted distribution, log x a (k a ), and the other is the characteristic 
function of the EP marginal (?(x a ), defined as x(k a ) = ( e a * a ) q . By virtue of matching the 
first two moments, and q(x a ) being Gaussian, 



r a (k a )=log X a(k a )-log X (k a ) = ^i' E ^f k « ( 18 ) 

l>3 \a\=l 

contains the remaining higher-order cumulants where the tilted and approximate distribu- 
tions differ. 

4.2.1 ISING MODEL EXAMPLE 

The cumulant expansion for the discrete distribution in Equation (10) becomes 



logxA) = log J dx n e iknXn q n (x n ) = log 



1 + m au 1 — m _ ik 
2 2 



1 i 1 

= imk n - — (1 - m 2 )k\ - -^{~ 2m + 2m 3 )fc 3 + —(-2 + 8m 2 - 6m 4 )/c 4 + • • • 

(we're compactly writing m for m n ), from which the cumulants are obtained as 

cin = m = — 2 + 8m 2 — 6m 4 

c 2n = 1 - m 2 c 5n = 16m - 40m 3 + 24m 5 

c 3n = -2m + 2m 3 c 6n = 16 - 136m 2 + 240m 4 - 120m 6 . 
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4.3 The correction as a complex expectation 

The expected value of F, which is required for the correction, has a dependence on a product 
of ratios of distributions g a (x a )/g(x a ). In the preceding section it was shown that the con- 
tributing distributions share lower-order statistics, allowing a twofold simplification. Firstly, 
the ratio q a /q will be written as a single quantity that depends on r a , which was introduced 
above in Equation (18). Secondly, we will show that it is natural to shift integration vari- 
ables into the complex plane and that the definition of complex Gaussian random variables 
(meaning that both real and imaginary parts are jointly Gaussian). These complex random 
variables that define the r a 's have a peculiar property: they have a zero self-covariance! 
This property has important consequences in the resulting expansion. This expansion is 
similar in flavor to Edgeworth expansions which are used to correct the asymptotic Gaussian 
statistics of sums of random variables. 

4.3.1 Complex expectations 

Assume that (/(x a ) = M(x a ; fi a , S a ) and f/ a ( x a) share the same mean and covariance, and 
substitute logXa(k a ) = r a {k a ) + logx(k a ) in the definition of q a in Equation (17) to give 

gq(xq) _ / e -^ x "+ r °( k °) x (k a ) dk a 
?(xa) ~ / e-^a x ( ka ) d k a 

Although the k a variables have not been introduced as random variables, we find it natural 
to interpret them as such, because the rules of expectations over Gaussian random variables 
will be extremely helpful in developing the subsequent expansions. We will therefore write 
(7 a (x a )/g(x a ) as an expectation of expr a (k a ) over a density p(k a |x a ) oc e aXa x(k<x)- Using 
logx(k a ) = ifJ-aka — kjE a k a /2, we see that p(k a |x a ) can be viewed as Gaussian, but not 
for real random variables! We have to consider k a as Gaussian random variables with a real 
and an imaginary part: 

p(k a |x Q ) = A/"(k Q ; -iE-^Xa - Ha) , S" 1 ) . (19) 

Coefficients k a are thus shifted into the complex plane, such that the expectation of 
expr a (k a ) is taken over the complex Gaussian distribution k a |x a , which has g(x )'s in- 
verse covariance matrix as its covariance! A second expectation over q(x a ) will reappear 
due to Equation (13); as an illustration, the simplest joint density p(k a \x a ) g(x a ) is shown 
in Figure 1. 

The Edgeworth type expansion arises when a Taylor series expansion is made of exp r a (k a ) 
around zero, and the expectation taken over k a in Equation (19), giving a real- valued series 
in x a . The result is the product of a Gaussian q(x a ) part and a higher-order deviation from 
a Gaussian density. Although not the focus of this work, an example appears in Equa- 
tion (42) in Appendix C. In this work, an Edgeworth expansion is implicitly made around 
each factor's tilted distribution, after which the interactions between these distributions are 
calculated using the expansion. 

To determine R, a further expectation over x is needed after integrating over k a . In the 
following section these variables will be combined into complex random variables to make 
the averages in the expectation easier. 
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Figure 1: Equation (19) shifts k a to the complex plane. In the simplest case the joint density 
p(k\x) q(x) is x ~ M{n, o" 2 ), lk{k) ~ JV(0, a~ 2 ) and equality ^s(k) = -a~ 2 (x - fx). 
It is a flat ellipsoidal pancake that lives in three dimensions: x and the complex 
k plane (tilted ellipsoid). Integrating over x gives the marginal over a complex k 
(upright ellipsoid). The marginal has $s(k) ~ AA(0, c -2 ), and hence k has variance 
(Oft(fc) + i%{k)) 2 ) = a~ 2 - a" 2 = 0. 



Using Equation (19), g(x a ) cancels out in the ratio that appears in F, which simplifies 

to 

^ = (expr a (k a )\ . (20) 

<Z(x a ) \ /k a |x a 

By substituting Equation (20) into Equation (13), R is equal to 

^=<^ x )>x^(x)=(n( ex p^( k ^)^ Xa ) • ( 2i ) 

When x is given, the k a -variables are independent, but when the uncertainty in q(x) is 
taken into account, the k a - variables are zero-mean complex Gaussian 

<k a ) = (<k a ) ka|Xa ) x = (-iS-^Xa - Ma)) x = 

and are coupled with a zero self-covariance! In other words, if Y] a b = cov(x a ,Xfe), the 
covariance cov(k a ,kb) between the variables in the set {k a } is 

COv(k a ,k b ) = ((k a kf> kai) | x ) x + ^^a 1 ((Xa-/^a)(x 6 -/X b ) T ) x I] b - 1 

if a = b 

-E-^V if a/ b ■ [2Z) 
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Term factorization 





Covariance structure of { k a } 
ki l<2 k 3 k 4 

ki 



ki2 

k 23 
k 34 

k 2 
k 3 



k 2 
k 3 
k 4 



<12 k 23 k 34 k 2 k 3 



Figure 2: The covariance between k a for two factorizations of Yln=i^n(x n ): the top il- 
lustration is for t^t^t^, while the bottom illustration is of a tree structure 
(tit2)(t2t3)(t3ti)/t2/t3. The white squares indicate a zero covariance, with the 
diagonal — the self-covariance cov(k a ,k a ) for each factor — being zero. From the 
properties of (22) there are additional zeros in the tree structure's covariance, 
where edge and node factors share variables. The factor fo = go is shadowed in 
grey in the left-hand figures, and can make q(x.) densely connected. 



Figure 2 illustrates the resulting covariance structure for two different factorizations of the 
same distribution. Each factor f a contributes a k a variable, such that the tree-structured 
approximation's covariance matrix will be larger than that of the fully factorized one. 

Section 5 shows that when D a = 1, the above expectation can be written directly over 
{k a } and expanded. In the general case, discussed in Section 6, the inner expectation is 
first expanded (to treat the D a powers) before computing an expectation over {k a }. 

In both cases the expectation will involve polynomials in /c-variables. The expected 
values of Gaussian polynomials can be evaluated with Wick's theorem. 

4.4 Wick's theorem 

Wick's theorem provides a useful formula for mixed central moments of Gaussian variables. 
Let k ni , . . . , k ne be real or complex centered jointly Gaussian variables, noting that they do 
not have to be different. Then 



(kn 1 ---k ne )=Y,U{ki v k jv) > ( 23 ) 
v 
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where the sum is over all partitions of {ni, . . . , n^} into disjoint pairs j v }. If i = 2m is 
even, then there are (2m)!/(2 m m!) = (2m — 1)!! such partitions. 3 If t is odd, then there are 
none, and the expectation in Equation (23) is zero. 

Consider the one-dimensional variable k ~ M(k\ 0, a 2 ). Wick's theorem states that 
(k e ) = (£-l)\\a e if £ is even, and (k e ) = OiHisodd. In other words, (k 3 ) = 0, (k 4 ) = 3(cr 2 ) 2 , 
(k 6 ) = 15(<7 2 ) 3 , and so forth. 

5. Factorized approximations 

In the fully factorized approximation, with f n (x n ) = t n (x n ), the exact distribution in Equa- 
tion (14) depends on the single node marginals -F(x) = f] n Qn(x n )/q(x n ). Following Equa- 
tion (21), the correction to the free energy 



R = ( JJ Uxpr n (k n 



exp 



^ 1 r n(kn) 



(24) 



is taken directly over the centered complex- valued Gaussian random variables k = (&i, 
which have a covariance 



{k m k n / 







if m = n 
if m n 



(25) 



In the section to follow, all expectations shall be with respect to k, which will be dropped 
where it is clear from the context. 

Thus far, R is re-expressed in terms of site contributions. The expression in Equation 
(24) is exact, albeit still intractable, and will be treated through a power series expan- 
sion. Other quantities of interest, like marginal distributions or moments, can similarly be 
expressed exactly, and then expanded (see Appendix D). 



5.1 Second order correction to log it! 

Assuming that the r n are small, Equation (24) is expanded and the lower order terms kept: 

1 



log R = log ( exp 



Tnikr* 



+ 




+ 



(26) 



The simplification in the second line is a result of the variance terms being zero from 
Equation (25). The single marginal terms also vanish (and hence EP is correct to first 
order) because both (k n ) = and (/c 2 ) = 0. 

This result can give us a hint in which situations the corrections is expected to be small: 

• Firstly, the r n could be small for values of k n where the density of k is not small. 
For example, under a zero noise Gaussian process classification model, q n (x n ) equals 
a step function t n (x n ) times a Gaussian, where the latter often has small variance 
compared to the mean. Hence, q n (x n ) should hence be very close to a Gaussian. 



3. The double factorial is (2m - 1)!! = (2m - 1) x (2m - 3) x (2m - 5) x ■ ■ ■ 1. 
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• Secondly, for systems with weakly (posterior) dependent variables x n we might expect 
that the log partition function logZ would scale approximately linearly with N, the 
number of variables. Since terms with m = n vanish in the computation of lni?, 
there are no corrections that are proportional to N when S mn is sufficiently small as 
N — > oo. Hence, the dominant contributions to logZ should already be included in 
the EP approximation. However, Section 7.3 illustrates an example where this need 
not be the case. 

The expectation (r m r n ), as it appears in Equation (26), is treated by substituting r n with 
its cumulant expansion r n {k n ) — ^]/>3 ^ ^-ink^ n jl\ from Equation (18). Wick's theorem now 
plays a pivotal role in evaluating the expectations that appear in the expansion: 



(r m (k m )r n (k n )} — i l+s — {k s m k l n ) 



l,s>3 



/l y y 

^2 v - \t-"m,m' L -"nn 

The second line above follows from contractions in Wick's theorem. All the self-pairing 
terms, when for example one of the I fc n 's is paired with another k n in Equation (23), are 
zero because (k%) = 0. To therefore get a non-zero result for (k^k^), using (23), each 
factor k n has to be paired with some factor k m , and this is possible only when I = s. Wick's 
theorem sums over all pairings, and there are l\ ways of pairing a k n with a k m , giving the 
result in Equation (27). Finally, plugging Equation (27) into Equation (26) gives the second 
order correction 

+ ■ (28) 

m^n l>3 



J mm / -"nn 



5.1.1 ISING EXAMPLE CONTINUED 

We can now compute the second order log R correction for the N = 2 Ising model example 
of Section 3.1. The covariance matrix has S nn = 1 from moment matching and £12 = 
J/(A 2 - J 2 ) with A = \ J 2 + VJ 4 + 4 . The uneven terms in the cumulant expansion 
derived in Section 4.2.1 disappears because m = 0. The first nontrivial term is therefore 
I = 4 which gives a contribution of | x 2 x ^T,f 2 = ^12 = |^12- I n Section 3.1, we 
saw that log Z — log Zgp = ^- plus terms of order J and higher. To lowest order in J we 



6 

have £12 = J and thus log R = 4r- which exactly cancels the lowest order error of EP. 



5.2 Corrections to other quantities 

The schema given here is applicable to any other quantity of interest, be it marginal or 
predictive distributions, or the marginal moments of p(x). The cumulant corrections for the 
marginal moments are derived in Appendix D; for example, the correction to the marginal 
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mean /Zj of an approximation g(x) = jV(x; /x, S) is 



P (x) Mi -2^ s - n 



+ 



while the correction to the marginal covariance is 



E 2 . 

l>3 j^n 33 



11 



Ej j E nn 



Ejj Sj/ n cijCi n i Ej r 



+ EE 



y • ■ V /I y..v 

.7.7 1 • V ^nn 



i-1 



+ 



(29) 



(30) 



6. General approximations 

The general approximations differ from the factorized approximation in that an expansion 
in terms of expectations under {k a } doesn't immediately arise. Consider R in Equation 
(21): Its inner expectations are over k a |x, and outer expectations are over x. First take the 
binomial expansion of the inner expectation, and keep it to second order in r a : 



/n( k «) 



k a |x 



1 



1 + (r a > + 7T (rl) + 



A, 



1 + A, 



(r a ) + \ (rl) + 



+ 



D a (D a - 1) 



(r a ) + ^ ( r l) + ■■■ 



+ ••• 



i i n / \ i ^ a / 2\ i D a (D a — 1) 2 
1 + -Da (r a > + — (r a ) + (r a ) + 



Notice that r a (k a ) can be complex, but (^(ka)^ i x > a s it appears in the above expansion, 
is real- valued. Using this result, again expand (Y[ a ( era )k a a |x) ■ ^he correction to log/?, up 
to second order, is 

lo S R = \ E D * D <> (( r »( k «))k a |x <^(k )> kb | x ) x + \Y, D a (D a - 1) (<r a (ka))L|x) x + " " ■ 

(31) 

In the above relation the first-order terms all disappeared as ((r a (k a ))) = 0. Terms involving 
((r a (k a ) 2 )) = similarly disappear, as every polynomial in the expansion r a (k a ) 2 averages 
to zero. This is a general case of Equation (26), in which D n = 1 for all factors. In Appendix 
B we show how to use the general result for the case where the factorization is a tree and 
our factors are edges (pairs) and nodes (single variables). 



7. Gaussian process results 

One of most important applications of EP is to statistical models with Gaussian process 
(GP) priors, where x is a latent variable with Gaussian prior distribution with a kernel 
matrix K as covariance E[xx T ] = K. 
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log lengthscale 



Figure 3: A comparison of logi?/logZEp using a perturbation expansion of (28) against 
Monte Carlo estimates of log R/ log Zep, using the USPS data set from (Kuss 
and Rasmussen, 2005). The second order correction to logR, with I = 3, 4, is 
used on the left; the right plot uses a Monte Carlo estimate of log it!. 



It is well known that for many models, like GP classification, inference with EP is on 
par with MCMC ground truth (Kuss and Rasmussen, 2005). Section 7.1 underlines this 
case, and shows corrections to the partition function on the USPS data set over a range of 
kernel hyperparameter settings. 

A common inference task is to predict the output for previously unseen data. Under a 
GP regression model, a key quantity is the predictive mean function. The predictive mean is 
analytically tractable when the latent function is corrupted with Gaussian noise to produce 
observations y n . This need not be the case; in Section 7.2 we examine the problem of 
quantized regression, where the noise model is non-Gaussian with sharp discontinuities. We 
show practically how the corrections transfer to other moments, like the predictive mean. 
Through it, we arrive at a hypothetical rule of thumb: if the data isn't "sensible" under the 
(probabilistic) model of interest, there is no guarantee for EP giving satisfactory inference. 

Armed with the rule of thumb, Section 7.3 constructs an insightful counterexample 
where the EP estimate diverges or is far from ground truth with more data. Divergence 
in the partition function is manifested in the initial correction terms, giving a test for the 
approximation accuracy that doesn't rely on any Monte Carlo ground truth. 

7.1 Gaussian process classification 

The GP classification model arises when we observe N data points s n with class labels 
y n 6 {—1, 1}, and model y through a latent function x with a GP prior. The likelihood 
terms for y n are assumed to be t n {x n ) = $>(y n x n ), where $(•) denotes the cumulative 
Normal density. 

An extensive MCMC evaluation of EP for GP classification on various data sets was 
given by Kuss and Rasmussen (2005), showing that the log marginal likelihood of the data 
can be approximated remarkably well. As shown by Opper et al. (2009), an even more 
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accurate estimation of the approximation error is given by considering the second order 
correction in Equation (28). For GPC we generally found that the I = 3 term dominates 
I = 4, and we do not include any higher cumulants here. 

Figure 3 illustrates the relative size of the correction, log Rj log -^ep, on the binary 
subproblem of the USPS 3's vs. 5's digits data set, with N = 767. This is the same set- 
up of Kuss and Rasmussen (2005) and Opper et al. (2009), using the kernel k(s, s') = 
cr 2 exp(— ^||s — s'\\ 2 /£ 2 ), and we refer the reader to both papers for additional and com- 
plimentary figures and results. We evaluated Equation (28) on a similar grid of log£ and 
logo" values. For the same grid values we obtained Monte Carlo estimates of logZ, and 
hence log R. The relative correction is remarkably small, underlining Kuss and Rasmussen 
(2005) 's findings on the accuracy of EP for GPC. 

The correction from Equation (28), as computed here, is 0(N 2 ), and compares favourably 
to 0(N 3 ) complexity of EP for GPC. 

7.2 Uniform noise regression 

We turn our attention to a regression problem, that of learning a latent function x(s) 
from inputs {s n } and matching real- valued observations {y n }- A frequent nonparametric 
treatment assumes that x(s) is a priori drawn from a GP prior with covariance function 
A;(s, s'), from which a corrupted version y is observed. Analytically tractable inference is no 
longer possible in this model when the observation noise is non-Gaussian. Some scenarios 
include that of quantized regression, where y n is formed by rounding x(s n ) to, say, the 
nearest integer, or where x(s) indicates a robot's path in a control problem, with conditions 
to stay within certain "wall" bounds. In these scenarios the latent function x(s n ) can be 
reconstructed from y n by adding sharply discontinuous uniformly random U[— a, a] noise, 



We now assume an EP approximation g(x) = A/"(x;/x, XI), which can be obtained by 
using the moment calculations in Appendix E.2. To simplify the exposition of the predictive 
marginal, we follow the notation of Rasmussen and Williams (2005, Chapter 3) and let \ n = 
{j n ,v n ), so that the final EP approximation multiplies g n terms Y\ n ex P{ — i r n x n + u nXn\ 
into a joint Gaussian A^(x ; 0, K). 

7.2.1 Making predictions for new data 

The latent function x(s^) at any new input s* is obtained by the predictive marginal q(x*) 
of q(x, x*). The marginal q{x if ) — given below in Equation (33) — is directly obtained from 
the EP approximation q(x.) = J\f(x;fi, S). However, the correction to its mean, as was 
given in Equation (29), requires covariances £* n , which are derived here. 

Let = fc(s*, s*), and k* be a vector containing the covariance function evaluations 
&;(s*,s n ). Again following Rasmussen and Williams (2005) 's notation, let S be the diagonal 
matrix containing l/r n along its diagonal. The EP covariance, on the inclusion of x*, is 
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£ k* — K(K + E) _1 k* 

k^-k^K + S)-^ ^-k^K + S)- 1 ^ 



(32) 



with 5] = K — K(K + S) _1 K. There is no observation associated with s*, hence r* = 
in the first line above, and its inclusion has q* = for I > 3. The second line follows by 
computing matrix partitioned inverses twice on S*. The joint EP approximation for any 
new input point s* is directly obtained as 



According to Equation (29), one needs the covariances X]* n to correct the marginal's 
mean; they appear in the last column of in Equation (32). The cumulants for this 
problem, used both for EP and correcting it, are derived in Appendix E.2. 

7.2.2 Predictive corrections 

In Figure 4 we investigate the predictive mean correction for two cases, one where the data 
cannot be expected to appear the prior, and the other where the prior is reasonable. For 
s E M, the values of x(s*) are predicted using a GP with squared exponential covariance 
function k(s,s') = #exp(-±(s - s') 2 /£). 

In the first instance, the prior amplitude 8 and lengthscale i are deliberately set to 
values that are too big; in other words, a typical sample from the prior would not match 
the observed data. We illustrate the posterior marginal q(x*), and using Equations (29) 
and (30), show visible corrections to its mean and variance 4 . For comparison, Figure 4 
additionally shows what the predictive mean would have been were {y n } observed under 
Gaussian noise with the same mean and variance as U[—a,a\: it is substantially different. 

In the second instance, log^Ep is maximized with respect to the covariance function 
hyperparameters 6 and i to get a kernel function that more reasonably describes the data. 
The correction to the mean of g(s*) is much smaller, and furthermore, generally follows the 
"Gaussian noise" posterior mean. When the observed data is not typical under the prior, 
the correction to (x*) is substantially bigger than when the prior is representative of the 
data. 

7.2.3 Underestimating the truth 

Under closer inspection, the variance in Figure 4 is slightly underestimated in regions where 
there are many close box constraints \x n — y n \ < a. However, under sparser constraints 
relative to the kernel width, EP accurately estimates the predictive mean and variance. In 
Figure 5 this is taken further: for N = 100 uniformly spaced inputs s € [0, 1], it is clear 

4. In the correction for the mean in Equation (29), we used I = 3 and I = 4 in the second order correction. 
For the correction to the variance in Equation (30), we used I = 3 in the first sum, and I = 3 and I — 4 
in the second sum 




with the marginal q(x*) being 



q(x t ) = M(x, ; k^K-V, ^ ~ k^(K + S)" 1 ^) . 



(33) 
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Figure 4: Predicting x(s*) with a GP. The yellow bars indicate the permissible x(s n ) values; 

they are linked to observations y n through the uniform likelihood l[\x n — y n \ < a]. 
Due to the U[— a, a] noise model, q(x*) is ambivalent to where in the "box" x(s*) 
is placed. A second order correction to the mean of q(x*) is shown in a dotted 
line. The green function plots p(x*), if the likelihood was also Gaussian with 
variance matching that of the "box" . In the top figure both the prior amplitude 
9 and lengthscale I are overestimated. In the bottom figure, 9 and t were cho- 
sen by maximizing log Zep with respect to their values. Notice the smaller EP 
approximation error. 



that g(x) becomes too narrow. The second order correction, on the other hand, provides a 
much closer estimate to the ground truth. 
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Figure 5: Predicting x(s*) with a GP with k(s,s') = exp{ — \s — s'\/2£} and I = 1. In 
the left figure logi? = 0.41, while the second order correction estimates it as 
log R ~ 0.64. On the right, the correction to the variance is not as accurate as 
that on the left. The right correction is log it! = 0.28, and its discrepancy with 
log R ~ 0.45 (EP+corr) is much bigger. 



One might inquire about the behavior of the EP estimate as N — > oo in Figure 5. In 
the next section, this will be used as a basis for illustrating a special case where log Zep 
diverges. 

7.3 Gaussian process in a box 

In the following insightful example — a special case of uniform noise regression — log Zep 
diverges from the ground truth with more data. Consider the ratio of functions x(s) over 
[0, 1], drawn from a GP prior with kernel k(s, s'), such that x(s) lies within the [—a, a] box. 
Figure 6 illustrates three random draws from a GP prior, two of which are not contained 
in the [—a, a] interval. The ratio of functions contained in the interval is equal to the 
normalizing constant of 

p(x) = |jJl[K| <a]AA(x; 0, K) . (34) 
n 

The fraction of samples from the GP prior that lie inside [—a, a] shouldn't change as the 
GP is sampled at increasing granularity of inputs s. As Figure 6 illustrates, the MCMC 
estimate of log Z converges to a constant as A — )• oo. The EP estimate log Zep, on the other 
hand, diverges to — oo. (The cumulants that are required for the correction in Equation 
(28), and recipes for deriving them, are given in Appendix E.l.) Of course the correction 
also depends on the value a chosen. Figure 7 shows that for both a — > and a — > oo the 
correction is zero for large N. 

An intuitive explanation, due to Philipp Hennig, takes a one-dimensional model p{x) = 
< a] Af(x; 0, 1). A fully-factorized approximation therefore has — 1 redundant 
factors, as removing them doesn't change p{x). However, each additional < a] truncates 
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Figure 6: Samples from a GP prior with kernel k(s,s') = exp{— \s — s'\/2£} with £ = 1, 
two of which are not contained in the [—a, a] interval, are shown top left. As 
N increases in Equation (34), with s n G [0,1], log^Ep diverges, while logZ 
converges to a constant. This is shown top right. The +'s and x's indicate the 
inclusion of the fourth (+) and fourth and sixth (x) cumulants from the 2 rd order 
in Equation (28) (an arrangement by total order would include 3 rd order C4-C4-C4 
in x ) . Bottom left and right show the growth for 2 rd order C4 correction relative 
to the exact correction. 



the estimate, forcing EP to further reduce the variance of q(x). The EP estimate using TV 
factors I[|x| < a\ l l N is correct (see Appendix C for a similar example and analysis), even 
though the original problem remains unchanged. Even though this immediate solution 
cannot be applied to Equation (34), the redundancy across factors could be addressed by 
a principled junction tree-like factorization, where tuples of "neighboring" factors can be 
co-treated. Although beyond the scope of this paper, Appendix A gives a guideline on how 
to structure such an approximation. 

8. Ising model results 

This section discusses various aspects of corrections to EP as applied to the Ising model — a 
Bayesian network with binary variables and pairwise potentials — in Equation (3). 
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Box width a, with |x| < a 



Figure 7: The accurateness of log^Ep depends on the size of the [—a, a] box relative to £, 
with the estimation being exact as a — > and a — > oo. The second order correction 
for Figure 6's kernel is illustrated here over varying a's. The +'s and x 's indicate 
the inclusion of the 4 th (+) and 4 th and 6 th (x) cumulants in Equation (28). Of 
these, the top pair of lines are for N = 100, and the bottom pair for N = 50. 



We consider the set-up proposed by Wainwright and Jordan (2006) in which N = 16 
nodes are either fully connected or connected to their nearest neighbors in a 4-by-4 grid. 
The external field (observation) strengths 0j are drawn from a uniform distribution 9i ~ 
U\— d bs> ^obs] with d bs = 0.25. Three types of coupling strength statistics are considered: 
repulsive (anti- ferromagnetic) Jij ~ U[— 2d coup , 0], mixed J. L j ~ U[— d coup , +d conp ], and at- 
tractive (ferromagnetic) ~ U[0, +2d coup ]. 

Previously we have shown (Opper and Winther, 2005a) that EP/EC give very compet- 
itive results compared to several standard methods. In Section 8.1 we are interested in 
investigating whether further improvement is obtained with the cumulant expansion. In 
Section 8.2, we revisit the correction approach proposed in (Paquet et al., 2009) and make 
and empirical comparison with the cumulant approach. 

8.1 Cumulant expansion 

For the factorized approximation we use Equations (26) and (29) for the log Z and marginal 
corrections, respectively. The expression for the cumulants of the Ising model is given 
in Section 4.2.1. The derivation of the corresponding tree expressions may be found in 
Appendices B and E.4. 

Table 1 gives the average absolute deviation (A AD) of marginals 



AAD = — ^ \P( x i = 1 )~ P( x i = l|method) = — ^ 



m,- — m? st I 



while Table 2 gives the absolute deviation of log Z averaged of 100 repetitions. In two cases 
(Grid, d conp = 2 Repulsive and Attractive coupling) we observed some numerical problems 
with the EC tree solver. We also observe that the accuracy of the solution for log Z is quite 
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Table 1: Average absolute deviation (AAD) of marginals in a Wainwright- Jordan set-up, 
comparing loopy belief propagation (LBP), log-determinant relaxation (LD), EC, 
EC with I = 4 second order correction (EC c), and an EC tree (EC t). Re- 
sults in bold face highlight best results, while italics indicate where the cumulant 
expression is less accurate than the original approximation. 



Problem type 


AAD marginals 


Graph 


Coupling 


^coup 


LBP 


LD 


EC 


EC c 


EC t 


Full 


Repulsive 


0.25 


.037 


.020 


.003 


.0006 


.0017 


0.50 


.071 


.018 


.031 


.0157 


.0143 


Mixed 


0.25 


.004 


.020 


.002 


.0004 


.0013 


0.50 


.055 


.021 


.022 


.0159 


.0151 


Attractive 


0.06 


.024 


.027 


.004 


.0023 


.0025 


0.12 


.435 


.033 


.117 


.1066 


.0211 


Grid 


Repulsive 


1.0 


.294 


.047 


.153 


.1693 


.0031 


2.0 


.342 


.041 


.198 


.4244 


.0021 


Mixed 


1.0 


.014 


.016 


.011 


.0122 


.0018 


2.0 


.095 


.038 


.082 


.0984 


.0068 


Attractive 


1.0 


.440 


.047 


.125 


.1759 


.0028 


2.0 


.520 


.042 


.177 


.4730 


.0002 



sensitive to the accuracy of the expectation consistency equations. It might be some cases 
that a solution does not exist but we ascribe numerical instabilities in our implementation 
as the main cause for these problems. It is currently out of the scope of this work to come 
up with a better solver. We choose to report the average performance for those runs that 
could attain || (a;) - (x) pi || 2 < 1(T 20 . This was 69 out of 100 in the mentioned cases and 
100 of 100 in the remaining. 

We observe that for the Grid simulations, the corrected marginals in factorized ap- 
proximation are less accurate than the original approximation. In Figure 8 we vary the 
coupling strength for a specific set-up (Grid Mixed) and observe a cross-over between the 
correction and original for the error on marginals as the coupling strength increases. We 
conjecture that when the error of the original solution is high then the number of terms 
needed in the cumulant correction increases. The estimation of the marginal seems more 
sensitive to this than the log Z estimate. The tree approximation is very precise for the 
whole coupling strength interval considered and the fourth order cumulant in the second 
order expansion is therefore sufficient to get often quite large improvements over the original 
tree approximation. 

8.2 The e-expansion 

In (Paquet et al., 2009) we introduced an alternative expansion for R in Equation (13), 
based upon a finite series expansion using 

i \ Qn{ x n) _ -. 

q{Xn) 
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Table 2: Absolute deviation log partition function in a Wainwright-Jordan set-up, compar- 
ing EC, EC with / = 4 second order correction (EC c), EC with a full second order 
e expansion (EC ec), EC tree (EC t) and EC tree with I = 4 second order correc- 
tion (EC tc). Results in bold face highlight best results. The cumulant expression 
is consistently more accurate than the original approximation. 



Problem type 


Absolute deviation log Z 


Graph 


Coupling 


^coup 


EC 


EC c 


EC ec 


EC t 


EC tc 


Full 


Repulsive 


0.25 


.0310 


.0018 


.0061 


.0104 


.0010 


0.50 


.3358 


.0639 


.0697 


.1412 


.0440 


Mixed 


0.25 


.0235 


.0013 


.0046 


.0129 


.0009 


0.50 


.3362 


.0655 


.0671 


.1798 


.0620 


Attractive 


0.06 


.0236 


.0028 


.0048 


.0166 


.0006 


0.12 


.8297 


.1882 


.2281 


.2672 


.2094 


Grid 


Repulsive 


1.0 


1.7776 


.8461 


.8124 


.0279 


.0115 


2.0 


4.3555 


2.9239 


3.4741 


.0086 


.0077 


Mixed 


1.0 


.3539 


.1443 


.0321 


.0133 


.0039 


2.0 


1.2960 


.7057 


.4460 


.0566 


.0179 


Attractive 


1.0 


1.6114 


.7916 


.7546 


.0282 


.0111 


2.0 


4.2861 


2.9350 


3.4638 


.0441 


.0433 



0.2 1 — , , , , 1 2.5 




Coupling strength Coupling strength 

Figure 8: Error on marginal (left) and logZ (right) for grid and mixed couplings as a 
function of coupling strength. 



as the small quantity instead of higher order cumulants. R has an exact representation with 
2^ terms that we may truncate at lowest non-trivial order: 



R = ( II (1 + e n {x n )) \ » 1 + ^m(x m )e n (x n )) + 0(e 3 ) . 

\ n I g ( x ) m<n 
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The linear terms are all equal to one because (%j^-y / = / g(x n ) dx n = 1 and since 



q n (x n ) is a binary distribution the quadratic term becomes a weighted sum of ratios of 
Normal distributions: 

(•Cm) \ \ ^ 1 ~i~ X m Ul m 1 + X n TTl n q(x m , Xn) 



li x m) / qM „. r-L.-, 2 2 q{x m )q(x n ) ' 

The final expression for the lowest order approximation to R is then 

R 1 1 + x m m m 1 + x n m n q(x m ,x n ) N(N - 1) 

haJUl 2 2 2 

From Table 2 we observe an improvement over the original factorized approximation and re- 
sults similar to the cumulant correction to the factorized approximation for all settings. The 
e-expansion may also used to calculate marginals and applied to generalized factorizations. 
These topics will be studied elsewhere. 



9. Future directions 

Corrections to Gaussian EP approximations were examined in this paper. The Gaussian 
measure allowed for a convenient set of mathematical tools to be employed, mostly because 
it admits orthogonality of a set of polynomials, the Hermite polynomials, which allowed a 
clean simplification of many expressions. Sofar we have restricted ourselves to expansions 
to low orders in cumulants. Our results indicate that already these first corrections to EP 
can provide useful information about the quality of EP. Small corrections typically show 
that EP is fairly accurate and the corrections improve on that. On the other hand, large 
corrections indicate that the EP approximation performs poorly. The low order corrections 
can yield a step in the right direction but in general their result may not be trusted and 
alternatives to the Gaussian EP approximation should be considered. It will be interesting 
to develop similar expansions to EP approximations with other exponential families besides 
the Gaussian one. 

Can we expect that higher order terms in the cumulant expansion will give more reliable 
approximations? Before such a question could be attacked one first would need to decide 
in which order the terms of the expansion should be evaluated in order to obtain the most 
dominant contributions. For example, we might think of trying to first compute all terms 
in the second order expansion of the exponential in Equation (26), and then move on to 
higher orders. An alternative is to sort the expansion by the total sum of the orders of 
cumulants involved. This is in fact possible by introducing a suitable expansion parameter 
(which is later set equal to one) such that the formal Taylor series with respect to this 
parameter yields the desired expansion. However, it is not clear yet if and when such a 
power series expansion would actually converge. It may well be that our expansions are 
only of an asymptotic type (Boyd, 1999) for which the summation of only a certain number 
of terms might give an improvement whereas further terms would lead to worse results. 

We expect that such questions could at least be answered for toy models such as the 
Gaussian process in a box model of Section 7.3. Our results for the latter example (together 
with the related uniform noise regression case) indicates that EP may not be understood 
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as an off the shelf method for approximately calculating arbitrary high dimensional sums 
or integrals. One may conjecture that its quality strongly depends on the fact that such 
sums or integrals may or may not have an interpretation in terms of a proper statistical 
inference model which contain data that are highly probable with respect to the model. It 
would be interesting to see if one can develop a theory for the average case performance of 
EP under such statistical assumptions of the data. 

Appendix A. Factorizations - Gaussian examples 

As p(x) is a latent Gaussian model, the g-terms in Equation (5) are chosen in this paper to 
give a Gaussian approximation 

g(x) = -^exp{A r <Xx)} = AA(x; ^E) . 

The sufficient statistics </>(x) and natural parameters A of the Gaussian are defined as 

0(x) = (x,-^xx T ) and A = (7, A) , 

where A T t/>(x) = 7 T x — \ tr[Axx T ] = 7 T x — ^x T Ax. There exists a bijection between the 
canonical parameters fi and E and natural parameters, such that the mean and covariance 
can be determined with I] = A -1 and /1 = S7. 

In Equation (1) we can define <?o( x ) = exp{Ag^(x)}, where Ao = (7^, A^ )), such that 
it is essentially a rescaling of factor fo. In the Ising model in Equation (3), this means that 
A(°) = J and -y^ = 0. In the Gaussian process classification model in Equation (2), this 
implies that A^ = K 1 and 7 (0) = 0. 

A.l Term-wise factorizations 

It remains to define a suitable factorization for the term-product JT n t n (x n ). This factoriza- 
tion can be fully factorized, factorized over disjoint sets of variables, factorized as a tree, or 
follow more arbitrary factorizations (see the simple example in Appendix C). A few such 
factorizations are given below in increasing orders of complexity. In each case we do not 
include the fo factor for clarity. Furthermore, even though the term factorization may be 
chosen to fully factorize, q(x) may be fully connected through the inclusion of fo. 

A. 1.1 Fully factorized 

A common factorization of Y\ n tn(x n ) is to set / n (x) = t n (x n ). The natural parameters 
of g n (x) = exp{A^</>(x)} are chosen to be A n = (7^ , Ann), corresponding to (j) n (x n ) = 
(x n ,— \x 2 ^). For clarity the other 7 and A parameters in A n are not shown, as they are 
clamped at zero. This gives an approximation g(x) that is defined by A = Ao + Yin ^n- 

A. 1.2 Factorization into disjoint pairs 

As a second step the variables can be subdivided into disjoint pairs x ff = (x m ,x n ). The 
factorization over terms couples pairs of variables through 

~[t n (x n ) = [t m {x m )t n (x n )] = Y[U( X ) ■ 

n 7r=(m,n) f 
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In this case each factor will have a contribution g„ (x) = exp{A^(x)} to the overall ap- 
proximation, and, as g n is a function of two variables, it is parameterized by the "correlated 
Gaussian form" A^ = (jm\ln ,^mm,Mm By symmetry = Ami- The result- 

ing q(x) is defined in terms of these disjoint sets with A = Ao + Yin A-n-. 

A. 1.3 Tree-structured factorization 

A tree structure factorization can be defined by extending the above "disjoint pairs" case 
to allow for overlaps between terms. Let Q define a spanning tree structure over all x, and 
let r = (m, n) G Q define the edges in the tree. Let d n be the number of edges emanating 
from node x n in the graph. Through a clever regrouping of terms into a "junction tree" 
form with 

n± t \ IlT=(m,n)[ t m(gm)*n(gn)] _ Ut M*) /„ R x 

the term-approximation will be tree-structured. In this example the D a powers are 1 for 
edge factors f T and (1 — d n ) for node factors f n . Let <? T (x) and g n (x.) be parameterized by 
A T and A n , as was done in the two examples above. Using 

rWx) n.exp{A^(x)} 



n^n(x)^- 1 n„exp{A^0(x)}^-i ' 
the resulting q(x) has parameter vector A = Ao + Y1 T ^ T ~ J2 n (dn — l)A n . 
Appendix B. Tree-structured approximation 

Let the factorization of the term-product Y\ n tn{x n ) take the form of a tree Q with edges 
t = (m,n) € Q, as is described in Appendix A. 1.3. The number connections to a node or 
vertex n shall be denoted by d n . From Equation (31) the second order expansion is 



1 °SR = \ E ((M(rr')) + ^(l-dm)(l-d„)((r m )(r n )) 

+ £(1 - d n ) «r T ) (r n )) + ± £(1 ~ d n ){-d n ) Ur n f) + • • ■ , (37) 



2 

n 



where the inner expectations are over k T |x and k n \x, while the outer expectations are over 
x. 5 The edge-edge, edge-node, and node-node expectations that are needed in Equation 
(37) are given in the following three sections. 



5. Some readers might wonder why there is no ^ associated with the sum over (t, n) in Equation (37). In 
the other quadratic sums, for example over m 7^ n, each (m, n) pair appears twice, as r m r n and as r n r m . 
Each edge-node pair makes only one appearance in the sum; if the sum double-counted by including 
node-edge pairs, a division by two would have been necessary. 
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B.l Edge-edge expectations 

The edge-edge expectation provides a beautiful illustration of the combinatorics that may 
be involved in Wick's theorem. For r^T / , the following expectation needs to be evaluated: 



«r T (k T ))<rv(M» = (EE^ 



i Z>3 s>3 



E (k: 



k-7-lx 



\a\=l 



E 

, |«'|=s 



a' 



(38) 

The vectors a that are summed over to get \a\ = I are a = (0, 1), (1,1 — 1), . . . , (I, 0); let 
a = (a±, I — a{) when \ct\ = I. From the independence of k T |x and k T '|x, 



<k?) kT|x (k?,) M 



'/k x ,k T ,|x 



/1 T „ ft, «, , 



a-, , s-a, 

/ k T ,k , 



(39) 



and therefore ((r T )(r T /)) = ((r T ?v)) whenever t/t'. 

Wick's theorem is again instrumental in computing (k"k" ), as all possible pairings of 
the random variables k T = (k T1 ,k T2 ) and k T / = (k T ',k T /) need to be included. As (fe^) = 0, 
(k n k T2 ) = 0, (A;^,) = 0, and (k T ^k T ^) = 0, the only non-zero expectations in the Wick 
expansion of Equation (39) occur when all the variables in k T and k T / are paired. This 
immediately means that (fe^ 1 fe^"" 1 re" 1 k s , Ql ) = whenever I ^ s, as there will be some 
remaining variables in k T (or k T /) that can't be paired and have to be self-paired with zero 
expectation. 

Given I = s, evaluate the expectation in Equation (39). We introduce the "pairing 
count" vector (3 with elements f3j G No and constraint Ylj=i Pi = I- Let f3\ count the 
number of pairings of k T1 with fc r / , and /?2 count the number of pairings of k T1 with k T ^ . As 
there are ol\ k T1 terms, the sum of its outgoing pairings should equal ct\ with 

Pi + fh = ati . 

A furthermore requirement is that 

Pl+Pz = ot'i . Ps + At = ol 2 , P2 + Pa 



"2 , 



where 02 = I — ol\ and a' 2 = I — a' 1: and /?3 and /34 be as in the Wick expansion below. 
Define B to be the set of all such (3's, and let C(f3) count the number pairing configurations 
for a given pairing j3. From Wick's theorem the expected value is equal to the sum over all 
possible pairings (3: 



h,ai v.a.1 u a 'i h. a 2\ 

K T! K T 2 K T [ K T 2 I 



k T ,k , 



E c(p) (k T1 k T ,fi(k T1 k T ,f 2 (k T2 k Tl f*(k T2 k T ,y 



A simple scheme to enumerate all (3 G B is to let 



ai - (3i, a[ - Pi, (1 + Pi) - (oi + a[) 



so that (3 £ B for each j3i G {max(0, (a\ + a\) — /),..., min(ai, a^)}. The remaining 
components of (3 are uniquely determined from j3i. 
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B.l.l Counting pairings 

How many pairings C((3) are there? 

1. There are (^) ways of choosing fi\ k T1 's, and then ^T^gni ways of choosing k T ^ to 
pair with. 



2. This leaves a remaining (a\ — f3\) k T1 's, that need to be paired with (I — ct^) k T >'s. 
((j-ay-toa-ft))! 



There are J! J^L^, such pairings. 



3. There are also — /?i remaining fe^'s, that need to be paired with k T2 variables. 
There are (Jr^l) ways of picking a fc T2 , and a further (a' t — ways of arranging 
the remaining k T ^ . 

4. Finally, the (I — a^) — (a± — /3±) remaining k' T2 s need to be coupled with the remaining 
k T r's, and there are ((I — a{) — (a± — /?i))! such arrangements. 

Multiplying the possible pairings from the four steps above gives 

c <« - G:) K^kji (i - a) «* ; - ft)! «' + ft) - + ° ;)) ' 

which adds up to the total number of possible pairings S/3gB^(/^) = ^- A further useful 
simplification is C{j3) / ol\ol'\ = 1//3! when |a| = | oc' | = I, and is used below. 

B.1.2 Edge-edge expectation 

The absence of any self-interacting loops from Wick's theorem lets the X] s >3 drop away in 
Equation (38), as all terms are zero except for when I = s. Substituting (k"k"') and C(j3) 
into Equation (38) gives the final result, 

«rv(k T ))<?v(M» (40) 

~ajfo T i k T ^ 1 (k Tl k T ^ 2 {k T2 k T i^' A {k T2 k T ^ 



l>3 \a\=l\a'\=l 



B.2 Edge-node expectations 

The derivation for the edge-node expectations is similar to that of the edge-edge case, 



((r r (k T )> (r n (k n ))) = £ ^ (K)^(K) knl , 

\!>3 s>3 \ a \=l ' ' I 

= ^^(—1)' Y2 TT| {k Tl k n ) 1 {k T2 k n ) 1 

Z>3 |a|=Z 
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where the expectations in the last line are again over {k T , k n }. When (k" k^} is evaluated 
with Wick's theorem, there are a\ copies of k Tl , I — a± copies of k T2 , and s copies of k n . 
The zero covariance of k T and k n ensures that the only non-zero terms in the Wick sum 
are those where all the &v's are paired with & n 's; in other words, when / = s. There are V. 
possible pairings, which cancels l\ in the denominator. 

The above edge-node expectation is for any edge and node in the tree, but notice that 
it simplifies greatly when the edge r is a connection to node n. Say t\ is the edge variable 
corresponding to x n . In this case the covariance with respect to the opposite pair is zero, 
with {k T2 , k n ) = (see Figure 2) and only one of the q's will have a non-zero contribution 
to the sum, namely when a = (1,0). 

B.3 Node-node expectations 

The node-node expectation is given in Equation (27), and is also used for ((r n ) 2 ). 6 



Appendix C. A tractable, one-dimensional example 

The following example illustrates a tractable one-dimensional model with two factors. It 
is shown analytically that the correction to log Zep must be zero, and that the result is 
reflected in the higher-order terms in Equation (31), which are also zero. 
Consider the factorization of a probit term with a Gaussian prior into 

p(x) = ^<5>(x)N(x;0,l) = ^fa(x) l ^ 2 f b (x) 1 / 2 M(x;0,l) , 

where <l?(x) is the cumulative Gaussian density function, and f a {x) = fb(x) = §(x). Z can 
be computed exactly, but for the sake of example p(x) will be approximated with 

0(aO = daix) 1 ^ 2 gb{x) l / 2 N (x\ 0, 1) = N(x] fj,, a 2 ) . 

Choose g a {x) = exp{^(x) T A a }, and gt{x) = ex.p{<ft(x) T \b} . The q approximation has 
parameter vector A = Ao + ^A a + ^A;,. The EP fixed point is defined by A a = At and 
Z a = Zb- (For example, subtracting A a at the fixed point will leave \\ a = Xq + 0, which is 
equal to a scaled version of the prior fo(x). The factor f a (x) = <&(x) is hence incorporated 
into the prior, giving Z a . By a symmetric argument, Z a = Zb-) Although it is trivial to 

l/ 1 ^ 1/2 

show that Zep = Z q Z a Z b will be equal to the true partition function Z, we shall prove 
it by showing that the correction term is log R = 0. 



C.l Analytic correction 

In this section a transformation of variables from x to y ~ M(y; 0, 1), with y = (x — fJ>)/o~, 
will be used to make the derivation slightly simpler, and therefore 

k a ; - — , o-" 2 j and k b \y ~ M ( k b ; - — , o~ 2 \ . 

6. Due to the square in ({r n ) 2 k | x ) , the inner average (r n ) k | x should first be computed to give an expansion 
over Hermite polynomials in x n — [i n . An example of such a result is given Appendix C. The orthogonality 
of these polynomials over q(x — /i.) allows {{r n ) 2 k i x ) to also reduce to Equation (27). 
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Below we analytically show that the correction log R is zero, and hence that 

R= // e r a (k a )\ 1/2 / e n(h)\ 1/2 \ = /yj^yyj^)\ =lj (4 i) 

\ ^ ' k a \y \ I kb\y / y ^ ' y 

where Taiu) is a shorthand for k a \y an< ^ 

l>3 ' l>3 

Because f a = ft,, the cumulants will be the same for all I, hence c a i = c^. Furthermore, 
k a \y and are both distributed according to the same density. Now define, using e ra = 
1 + r a + \r\ + ■ ■ ■ , 

*>(y) = (i + E l T k « + \ D + • • • 

\ Z>3 ' «,s>3 

- 6 + E f (;)'<* + + 5 E 1ST (;)"* + il "' +< + ■ ■ ) 

\ Z>3 V 7 Z,s>3 V 7 / u 

= i + E f (?)' *.<»> + 5 S If + ' ' ' < 42 > 

«>3 v 7 Z,s>3 v 7 

In the second line above a transformation of variables was made in the integral, with u = 
ak a + iy, such that k a = (u — iy)/o~. The Jacobian 1/a ensures proper normalization so 
that the average is over u ~ N{u; 0, 1). In the last line Hi(y) is the Hermite polynomial of 
degree I, 

H (y) = l H 1 (y) = y H 2 (y) = y 2 - 1 

H 3 (y) = y 3 -3y H 4 (x) = y 4 - 6y 2 + 3 H 5 (y) = y 5 - 10y 3 + 15y ••• 

which can be obtained for any real y and integer I = 0, 1, 2, . . . from the average Hi(y) = 
((y + iu) 1 ) over u ~ N{u; 0, l). 7 

The remarkable property (Hi(y)) = for all I, ensures that {J~ a (y)) y = 1 in Equation 
(42). Furthermore, T a (y) = Fb{y) follows from the equivalence in cumulants c a i = cm; the 
roots in Equation (41) disappear to give (T a (,y)) y i proving that R = 1 in Equation (41). 

C.2 Second order correction 

The second order expansion in Equation (31) in Section 6 evaluates to zero, as the matching 
cumulants c a i = cu and equal distributions of k a \x and k b \x ensure that {r a (ka)) k a \ x = 

( r b(h)) kblx - 

lo g R =\ ((r a (k a )) kalx (r b (k b )) kblx ) x - i (((r a (k a ))t alx ) x + ((r b (h))l blx ) J + ■ ■ ■ 

7. When J-(y) in Equation (42) is rearranged as a power series in a 1 , we obtain an Edgeworth expansion to 
arbitrary order I. The deviation from the Gaussian q(y) is thereby factorized out of tilted distribution 
with q a (y) = q(y)J-(y). The interested reader is pointed to Blinnikov and Moessner (1998). 
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= l(MM)L,^-^(2((r a (^)L|,)J+--- 
= () + ••• . 



Appendix D. Corrections to marginals distributions 

Corrections to the marginal distributions follow from a similar derivation to that of the 
normalizing constant. As a simplification, let the Gaussian approximation be centred with 
y = x — /x, so that q(y) = j\f(y; 0,S), and assume that q(x) is arises from the fully 
factorized approximation in Section 5. In this appendix corrections will be computed for the 
mean fa - /^) p(x) = (yi) p{y) , and variance (fa - fJ,i)fa - fij) - £*j) p(x) = (ViVj)^) ~ Ey- 
A further simplification that will be employed in the following section is a change of 
variables i] n = k n + iS~^y n , so that % ~ M(r] n ; 0, Let 

— _ 'V" 1 

Zn — f]n l^nnVn , 

which is zero-mean complex Gaussian random variable with a covariance (zfy = and 
(z m z n ) = — S mn /(S mm S nn ) when m ^ n. Following Equation (24), the correction reads 



\ n I y \ n I y 



cxp 



D.l The marginal mean 



The lowest order correction to the EP marginal's mean follows from the result in Equation 
(14): 



fo> P (y) = ^(w e^ 2 "^ 



+ 



^ Yl S< J \ ^7 ( 1 + X] rn ^ n ) + \ Yl r m{Zm)r n {Zn) 
j \ ^ \ n m,n 

In the above expansion the first order term is q^ 3 = q z gjp = ~^Jj dz ' an< ^ disap- 
pears as = 0. The j = n second order term also disappears as = 0- 




These equivalences can be seen by taking rj(Zj) (and also its derivative) as a expansion 
over powers of ^; as (Zj) = 0, Wick's theorem states that every expectation of powers of 



should be zero. Hence 
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The derivative of the characteristic function, as required in Equation (43), is 



drj{z 3l 



dZA 



d_ 

dz~ 



l>3 



l\ i 



l>3 



d-1 C lj J-l 

((-1)!' 



Z>2 



The expectations for j ^ n in Equation (43) evaluate to 



dZj 



,l>3 



i ^2 l ~ 

Z>3 



l\s\ 

■21 c l+l,j c ln I J J 



s>3,l=2 



2!s! \ 3 n 



(i\y 



3 n ' 



(44) 



with the second term disappearing as s > I = 2 ensures that some z n is always self-paired 
in Wick's theorem. Finally, by substituting Equation (44) into (43), the correction to the 
mean is 



p(y) 



EEff 

l>3 j^n JJ 



Ejj Q+lj'Qn / E 



/! 



EjjE nn 



(45) 



D.2 The marginal covariance 

The correction to the second moments follow the same recipe as that of the marginal mean 
in Appendix D.l. We proceed by first treating yi with 



— Ei (Si. e^™ 1 "™^™) 4- v-^—e^ 



Tn (^n ) 



i(Zn) 



Reapplying the recipe gives the correction to the covariance: 



ki 



9 dr l( Z lK-E n rn(z n )\ + 



dyt dzi 



E 

kl 



Ej/ Ejfc 
Ezz Ekk 



d 2 ri(zi) dr k (z k ) dri(zi) 
Okl — ~ o H 



dz 



EE 



EiiEji c s kC s+ 2,i ( Eki 



s>3 fc^Z " 



EfcfcEz 



<9z fc <9z; 

+ EE 

s>3 fc^Z 



e 



Ej; Ejfc c s kC s i 



-ki 



s-l 



+ 



34 



Perturbative Corrections for Approximate Inference 



Appendix E. Higher order cumulants 

Much of this paper hinges on cumulants beyond the second order. These are frequently more 
cumbersome to obtain than the initial moments that are required by EP. This appendix 
provides details of the cumulants used in this paper. 

The cumulants of a distribution q n (x) can be obtained from its moments through 

c 3 = (x 3 )-3(x 2 )(x)+2(x) 3 

c 4 = ( x 4 ) _ 4 (x 3 ) (x) - 3 (x 2 ) 2 + 12 (x 2 ) (x) 2 - 6 {x) 4 

c 5 = ( x 5 ) - 5 <x 4 > (x) - 10 (x 3 ) (x 2 ) + 20 (x 3 ) (x) 2 + 30 {x 2 f (x) - 60 (x 2 ) (x) 3 + 24 (xf ; 

they are derived for doubly-truncated Gaussian distributions in Appendices E.l and E.2. 
One might also directly take derivatives of the cumulant generating function, and the cu- 
mulants of a Probit-times-Gaussian distribution, common to GP classification models, are 
derived this way in Appendix E.3. 

The tree-structured approximation in Sections 6 and 8.1, and Appendices A. 1.3 and B, 
require cumulants over two variables. They are presented in Appendix E.4 for the Ising 
model. 



E.l Doubly truncated centered Gaussian 

Consider the centered distribution q n (x n ) cx I[|x n | < a] M{x n ; 0, A" 1 ). The odd moments 
of this tilted distributions are, by symmetry, (x n ) = (x 3 ) = (x^) = 0. Let 

/ \ [' a 1 2 

Z n = 2\ — e~ Ax dx = 2$0) - 1 , z = V\a, 
V 2vr J 

with the Probit function being <£(x) = J"^ N(z\ 0, 1) dz. Subscripts n are dropped where 
they are clearly implied by their context. To get the even moments, consider 



At = d x log Z n = d x log U\ j" 1 dx =~\ 
A 2 = d 2 lo g Z n = -^- 2+ \((x 4 )-(x 2 ) 2 



2 x ' 



Using the partition function, we get 



Ai = 9 A log(2*(z)-l) 
A' 



( M{z) 



y/\ \2$(Z) - 1, 

a 2 ( zN{z) \ a ( N{z) \ a 2 ( N{z) 



and thus 



2\ \2<5>(z) - 1 J 2A 3 / 2 \2§{z) - lj A \2*(«)-l 



{x 2 ) = (46) 
(x 4 ) = y 2 + (x 2 f + ±A 2 . (47) 
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Figure 9: The moments of q n (x) oc I[\x\ < a]Af(x; n, a 2 ), as a function of a 2 . As the 
Gaussian variance a 2 — > oo, the moments converge to that of a uniform U[—a, a] 
distribution. 



We can further determine ^3 = c?? log Z n using the partition function, giving 
A 3a f M{z) \ | 3a 2 / zM{z) \ 3a 2 / M{z) \ 2 | 2a 3 ( M{z) 



4A 5 /2 \2<$>{z) - lj A\ 2 \2<S>(z)-lJ 2\ 2 \2<S>(z) - 1 J A 3 / 2 \2$(z) - 1 
3a 3 / M{z) \ / zjV(z) \ a 3 f(z 2 -l)A/» 



2A 3 / 2 \2§{z) - 1 / V2^(^) - I J 4A 3 / 2 V 2*(z) - 1 
Therefore 

(x fl ) = A + ( x 2 ) (^) _ 2 (x 2 ) 3 + 2 (x 4 > (x 2 > - 8A3 . (48) 
E.2 Doubly truncated non-centered Gaussian 

The same calculation from Appendix E.l can be repeated to get the moments of the non- 
centered truncated Gaussian q n {x n ) oc I[\x n \ < a]M{x n ; fi, A" 1 ). The subscripts n are 
dropped where evident. The partition function is 

Z(A,m) = \[^[ e-^ ( *-" )a dx = $(z max ) - $(z min ) , 

where 

Zmax = v/A()ti + a) , and z min = - a) . 

By again taking increasing derivatives of Z(X, fi) with respect to \i and A, the moments 
solved for are 

( X )=g+ J_ ^max)-^min) , , 

-max 

) - $(2 min ) 

/ 2\ <-> / \ , 1 2 1 ^maxA (^max) ZramJ* (^min) / r „\ 

(x) = 2(x) M+ -- y -- $(w) _^ mh) (50) 
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Cumulant C3 with m = and v = (step function) Cumulant C4 with m = and 1; = (step function) 

lo.s 






, 1 





0.4 
0.2 


-0.2 
0.4 
-0.6 



Figure 10: The third and fourth cumulants of the density q n (x) oc Q((x — m)/v)J\f(x;fj>,a 2 ) 
in Appendix E.3. The step function Q(x), with m = v = 0, is taken as an 
example here. The third cumulant is always positive, while the fourth cumulant 
is positive only when a > fi. 



3 o 2 
--3/, 



(x 3 ) =3(x 2 )n + {x) 

1 (1 " A/"(z max ) - (1 - zj in ) AA(z min ) 



3 3 
A^ + /? 



A 3 / 2 

(x 4 ) = A(x 3 )fi + (x 2 ) 



^(^max) - ^min ) 



A 



6/T 



+ <x) 



A 



A 



/'" 



M 4 + 



A 2 



1 ^max(l "I - ^ m ax) -A/*(^max) ^min(l "I - ^min) -^"(^min) 
A 2 ^(^max) - $(Zmin) 



Finally, 

(x 5 ) = 5(x*)fi + (x 3 ) 
6 



A 



10^ 2 



+ <- 2 > 



10/T 



18 
"A 



+ {x) 



18 2 4 3 
T M -5/i 



(51) 



(52) 



+ A^ 



3 1 5 _ j; (1 + ^ z max z max) -^( z max) (1 + ^ Z min Z min) ^( z min) 

A A*/ 2 *(w)-*(*n*i) 



(53) 



As Figure 9 illustrates, these moments will converge to that of a uniform distribution as 
the Gaussian's variance grows large. 



E.3 Probit link cumulants 

EP approximations to Probit regression models, and Gaussian process classification models 
in general (see Section 7.1), depend on the moments of q n {x) oc $((x — m)/v ) M(x; /i, a 2 ). 
We introduce v > so that the likelihood can become a step function at v = 0, for example. 
We shall obtain the cumulants by taking derivatives of the characteristic function. The 
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characteristic function of q n (x), as described by Equation (16), is 



Xn (k) = (e ikx ) 



Qn{x) 



1 



2_2 



exp < ikfj, — —k a 



with 



a — m u, + ika — m 

z = , = = and Zfe = 



y/v 2 + cr 2 



\/'i7 2 + cr 2 



The cumulants q n are determined from the derivatives of logXn(^) at zero; a lengthy 
calculation shows that they are 



c 3n = a 3 p[2/3 2 + 3zfi + z 2 - l] 

c 4n = - a 4 /3[6/3 3 + 12z/3 2 + 7z 2 (3 + z 3 - 4(3 - 3z) 



(54) 
(55) 



where a = a 1 j\Jv 2 + a 2 and /3 = Af(z;0, l)/$(z). 



E.4 Two-variable Ising model cumulants 

We need some third and fourth order two- variable cumulants and thus generalize the results 
of Section 4.2 to the bivariate case. To do this we can exploit the cumulant generating 
property of logXa(k a )- Let en ^ denote the joint I, I' order cumulant of variable one and 
two, respectively. We can generate this cumulant from derivatives of logXa(k a ): 



C(l,l>) 



d V ( d 



log Xa(K] 



dik\ ) \dik2 

We can also express this as a recursion in terms of cumulants: 



k=0 



z {l+n,l'+ri) 







dik\ 







dik-2 



C(«,/')( k ) 



k=0 



By explicit calculation for a bivariate binary distribution we get the first two orders cumu- 
lants: c(l,0) = mi, c(0, 1) = m 2 , c(2,0) = 1 — m 2 , c(0, 2) = 1 — m 2 and c(l, 1) is equal 
to the covariance between the two variables (to be matched with q(x)). The fact that we 
can write c(2, 0) in terms of the first order cumulant shows that we can express all order 
cumulants in terms of the first and second order cumulant for example: 



C (2,D=^Wk) 



8 



k=0 



dih 



(1 



'(1,0) 



00) 



-2c 



k=0 



(1.0) c (l,l) 



Using the same recursion it is easy to show: c 



(3,0) 



-2c 



(l,0) c (2,0)) c (4,0) 



2c(i,o)C(3,o), c (3,i) = - 2 C(2,o)C(i,i)-2c (li o)C (2 ,i) and C( 2j2 ) = — 2c (11) -2c (lj o)C( lj2 ) = -2c (1]1) + 



-2r 2 

ZC (2,0) 

2c? 



4c 



(1,0) C (0,1) C (1,1)- 
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